wera 


Jeny 


[35] 
Leal 


rn! 


—— 


Final PR-1032-77-418 
April 29, 1977 


Co 


ACALP ALGORITHM SELECTION AND APPLICATION 


i lca 


Moshe Ben-Bassat 
Antonio Leal 


. anh, 


Prepared For: 


=_— aa 


TRW Systems Group 
One Space Park 
Redondo Beach, California 90298 


= Oe he 


PERCEPTRONICS 


6271 VARIEL AVENUE ® WOODLAND HILLS ¢ CALIFORNIA 91367 © PHONE (213) 884-7470 


. 


— 


a 


‘Final PR-1032-77-418 
April 29, 1977 


ACALP ALGORITHM SELECTION AND APPLICATION 


Moshe Ben-Bassat 
Antonio Leal 


Prepared For: — 
TRW Systems Group 


- Qne Space Park 
Redondo Beach, California 90298 


PERCEPTRONICS. 


6271 VARIEL AVENUE © WOODLAND HILLS ® CALIFORNIA 91364 ® PHONE (213) 884-7470 _ 


TABLE OF CONTENTS 


INTRODUCTION 

1.1 Overview 

1.2 The Problem 

1.3 The Classification Process © 
. SYSTEM DESCRIPTION 


2.1 Deterministic Rules (LEVEL I) 


2.2 Nearest Neighbor Classification (LEVEL II) 


1 Performance Considerations 
2.2.2 Distance Functions 
2.2.3 The Rejection Option 
2.2.4 Distance Weighted k-NN Rule 
2.2.5 The Edited k-NN Method 
2.2.6 Implementation 


2.3 Classification by Clustering (Level III) 


2.3.1 The Clustering Algorithm 
2.3.2 Learning in Real Time 


EVALUATION 


3.1 Overview 
3.2 Approach © 


REFERENCES 


I 


evel ged ewand 
i 
a wd owt 


1 
onal 


} 
| 


1 


a, 
ant (DO {© SHON 


GQ 


i 
ood 


i 
| 


I 


1. INTRODUCTION 
1.1 Overview 


This report describes a pattern recognition and learning system which 
can be applied to several Ballistic Missile Defense (BMD) control problems, 
as well as many problems in other disciplines. The basic problem is | 
concerned with the recognition of objects as either belonging to one of 
a given set of classes or possibly as a new unforeseen type. In the 
latter case, the properties of the new type need to be learned by the 
computer. This classification and learning must be performed in real 
time and under time constraints. For the sake of clarity and to motivate 
the algorithm development, it is described as applied to the recognition 
of a re-entry vehicle threat cloud which is approaching a friendly target. 


12 The Problem 


It is assumed that objects in a threat cloud composed of RV's 
(Reentry Vehicles) and non-RV's are to be classified and distinguished as 
to type. The known classes include several types of RV's, decoys (DC), 
and tank fragments (TF), and will be denoted henceforth by Ci» Coo sees Cy. 
It is also possible for the cloud to contain oRJEcES which are not members 
of any of the known classes. 


The classification process is dynamic. The cloud is observed 

- continuously and the collected data is analyzed at discrete points of time, 
t,> to» .-+» tps where t, denotes the latest time for a defense action. 

(See Figure 1-1.) Final classification of any of the objects can be made 
if sufficient evidence has been accumulated. If an object is identified © 
as not belonging to one of the known classes and temporal considerations 
allow, then the characteristics of this new type are learned so that similar 
Objects may be more rapidly and accurately classified. A decision is also _ 
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made whether the new type is more likely to be an RV or a decoy. At time 
te all the objects need to be classified into either a known class or a 
new one. A "no decision" is not allowed. — 


Information features which may be used for classification include: 
trajectory motion, radiant intensity and scintillation, reflected sunlight, 
temperature history, tumbling frequencies, etc. The pattern recognition 
model assumes that the preprocessing required ‘to extract the classificatory 
features has been made and that at a given point of time, each object is 
represented by a pattern of its features (x, 5 Xoo sees X,)- This pattern 

can represent a single observation or an average of continuous observations 
over a period of time. No assumption is made with regard to the distribution 
of these features. | 


Object classification is mainly based on its feature pattern. However, 
for boundary cases, classification can be assisted by any holistic information 
_ that may be available. For instance, intelligence reports may provide 
information that a given RV cloud contains at most three RV's. In this 
case, if three RV's have already been identified, then a boundary case 15S 
more likely to be a non-RV. Other holistic information can be derived 
from physical characteristics of the whole cloud, for instance, a cloud of 
volume (weight) V cannot carry more than x RV's. Clearly, the reliability 
of holistic information plays a prominent role in determining the rules 
which are derived from it. | 


| The main characteristics of this object classification problem can _ 
be summarized as follows: | ~~ 


1. Classification of many objects in real time. 
2. Classification as early as possible, when the data is of 


adequate quality for a decision rather than after a 
predetermined amount of data has been collected. 
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3. Recognition of new classes. 
4. Learning new class characteristics in real time. 
1.3 The Classification Process 


This section describes the top level structure of the pattern _ 
recognition and learning algorithm. Following sections describe the 
subfunctions of the algorithm in greater detail. 


The three-level system described below and shown in Figure 1-2, 
processes and classifies one object at a time as each sensor data pattern 
is presented to it. The pattern submitted at each point of time may either 
be independent or an average of all the patterns previously observed for 
this object. Outside of this hierarchial structure jis a more genera | 
information processing loop that uses any holistic information that may be 
available to aid the single object classification. 


Individual object classification 1s made in a hierarchial process 
composed of three main levels: 


Level 1] Immediate classification by deterministic rules; 
Level 2 Classification by the nearest neighbor rule; 
Level 3 Final Classification of Rejected Objects. 


During Level I processing, predefined deterministic rules are 
evaluated in order to classify objects with an obvious and certain identity. 
For example, if the absolute temperature of an object is below a given. 
threshold, it may be possible to classify it immediately as a non-RV. If 
this level does not lead to a clear cut decision, the information collected 
is retained for the next process levels. Notice that if all the classes 
are eliminated by these rules except for one known class, it does not imply 
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that the object belongs to this single remaining class since it is possible | 
that the object could be of a new type. 


In Level II, the object is either finally classified by the k, k* 
nearest neighbor rule or rejected and submitted for the third classification 
level. Again, the information accumulated up to this point is retained 
for the next level. 


Finally, the first task at Level III is to classify the object as 
an RV or non-RV. The second task is to identify whether this object is 
one of the old types or is a new type. Both tasks are carried out via 
a clustering approach. | | 


Figure 1-3 specifies the possible decision outcomes at each level 


of the algorithm. The following sections will describe the operation of 
each level individually and their interrelationships. | 
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2. SYSTEM DESCRIPTION 


2.1 Deterministic Rules (Level I) 


Some objects can easily be identified as RV's or non-RV's, or 

as a certain type of RV or non-RV. In these cases, it is possible to ~ 
simplify the pattern recognition process by using deterministic rules. A 
deterministic rule might state that if feature X is not in the range [a,b], 
then the object cannot belong to class C,. In our example, the corresponding 
rule might state that an object colder than 270°K or hotter than 4009K jis not 
an RV. Such rules facilitate and accelerate the classification of obvious 
cases by using detailed knowledge about the specific problem. Because 
deterministic rules are inflexible, they must be carefully prescribed to 
avoid sensitivity to "spoofing". The rule given above could be spoofed by 

a heated RV, for example, at a non-trivial performance penalty. For this 
reason, only the most certain deterministic rules should be used. A more 
effective deterministic rule would classify stationary objects as non-RV's 
and be unspoofable. Because objects which are not classified by the 
deterministic rules will be classified at a later stage, only the clearest 
cases should be classified at this point. Figure 2-1 presents the structure 
of Level I process. | 


If, at the end of Level I only one admissible class remains, that -_ 
is, all other classes have been eliminated, the system proceeds straight 
to Level III. Otherwise, it proceeds to Level II. | 


Ze Nearest Neighbor Classification (Level II) 


Any Classification method which can incorporate a rejection option 
is suitable for Level II. With the rejection option, whenever a criterion 
for incorrect classification (e.g., probability of error) is not satisfied, 
classification is deferred. This includes parametric or nonparametric — 
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discriminant functions methods, nearest neighbor methods , or any other ad 
hoc method. In order not to confine the algorithm to specific assumptions 
on the nature of data, the (k, k~) nearest neighbor method is adopted. 


In the simplest form of the nearest neighbor (NN) method, when a 
new pattern arrives, its distance from each of the preclassified training 
patterns is calculated, and it is assigned to the class of its nearest 
neighbor class. Improved versions of this approach are obtained by 
considering the distances from the k-nearest neighbors and using a majority 
rule. The basic structure of the k-NN method is as follows: 


— Input 


NC 
NT = number of training patterns 


number of classes 


NF = number of features (dimension of patterns) 

= (x! x2, ee XNT) set of training patterns 

= (2,5 hos re Lye) labels of training patterns 
unknown pattern 

= distance function 


> 2 | 
It 


= majority parameter 
Procedure 


(1) Compute a(x, X) for j = 1, 2, ...5 NT , SS | 
(2) Identify T= (54> dos vies j,) indices of the k nearest neighbors 


L, = (2. 


Wis. id. “ence: ee labels of the k-NN 
k Jy? Jo i, | . 


(3) Count N. the occurrence of class i in Ly | 


(4) Assign X to class c* where Nox = max (N)...N.) 
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2.2.1 Performance Considerations. Nearest neighbor techniques are 
attractive due to the asymptotic relationship between the expected probability 
of error of the NN classifier (P,) and the optimal Bayes classifier (Pr). 
For a large set of training patterns and k = 1, the following inequality 
holds (Cover and Hart, 1967): 
p* NC *2 

Pa Pa We-T Pe 
Roughly | speaking, the significance of this inequality is that Pa is at most 
twice Pe for a large set of training patterns. For k>l, tighter bounds are 
sbeained. 


The main disadvantage of the nearest neighbor methods is that the 
training patterns are stored in the classifier and used in the classification 
phase which implies computational difficulties for a large training set. 
On the other hand, for a limited training set, the attractive asymptotic 
properties of NN rules may not hold. Hence, the relationship between the 
size of the training set and the level of performance of NN rules is crucial. 
Kanal (1974) reviews recent literature on this subject. Algorithms for 
efficient search of the k nearest neighbors are described by Fukunaga and 
Narenda (1975); Yunck (1976); and Friedman, Basket, and Shustek (1977). A 
survey of these techniques is given by Bentley (1975). 


2.2.2 Distance Functions. Any metric can be used as a distance function. 
The distance function most commonly used belongs to the Minkowsky family 


of metrices: 


J 2 


When the feature Scales are not of the same magnitude and correlated features 
exist, a weighted distance function is used. For that purpose the covariancce 
matrix is usually used and a typical distance function is: 
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a(xd, xy = (xd - xy at (xd — yy 


where £ is the feature covariance matrix over the whole set of training 
patterns. Currently the latter distance function is being used. 


2.2.3 The Rejection Option. When the high N. values are very close to 
each other, classifying X to c* involves high risk of misclassification. 
To avoid this, the rejection option is used by which high potential of — 
misclassification is converted into rejection, according to the following 
rule: 


Let k~ denote a threshold majority level. Then, if No# > ko 
assign X to c™, otherwise reject X. The value of k* is determined 
experimentally based on our tolerance level for misclassification. 


It has been proven theoretically and experimentally that this | 
modified NN rule, which will be referred to as "(k, k*) -NN rule", does 
improve the performance level of the classifier. Exact relations between. 
the reject rate and the error rate can be found in Hellman (1970) and 
- Devijver (1976). | 


2.2.4 Distance-Weighted k-NN Rule. A variant of the K-NN method is based 
on the argument that it appears more reasonable "to weight the evidence 

of a neighbor close to an unclassified observation more heavily than the 
evidence of another neighbor which is at a greater distance from the 
unclassified observation" (Dudani, 1976). Pursuing this idea, the following 
distance weighted k-NN procedure is used: | 


(1) Determine the k nearest neighbors Ty = {J;> meee J, and their 


corresponding labels L, = {2. 5 ..., &. } where © | 
X is the closest one while X K is the farthest. Let 


d, < dy, Koes dy be their corresponding distances from X. 


(2) For each one of the k-NN compute: 


d,-d. | 
= Kg | 
1S es dee 
j dy d, k ] 
| > dp = dy 


(3) Count N. the occurrence of class i in LL. 


(4) For every class i, sum up the weights of the N. neighbors 
from that class. | 


(5) Assign X to the class with the highest total weight. 


- Experiments with the Distance Weighted k-NN rule reveal better 
performance (Dudani, 1976). Theoretical proof has not yet been reported. | 


A possible rejection criterion for this method is the following: 
Reject the pattern from classification if the highest total weight is 
‘less than a predetermined threshold. This criterion is proposed here for 
the first time and has not yet been examined. 


2.2.5 The Edited k-NN-Method. Wilson (1972) proposed the following 

method for cleaning the training set of "poor" patterns. A poor pattern 

is one that lies in a region where most of the other training patterns belong 
- to aclass different from the class of this particular pattern. 


For each 7, use tx! es Xx! e Xs ae yNT, as the training set 
| and classify x1 by the k-NN method. If X' is correctly classified, proceed 
to the next i. If x! is misclassified, delete it from the training set and 
then proceed to the next 71 with the reduced set. 


In addition to reducing the storage required for future classification 
of unknown patterns, the Edited k-NN method has better asymptotic performance 
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than the K-NN rule (Wilson, 1972; Wagner, 1973). Although, a recent 

paper by Pendro and Wagner (1977) points at an error in Wilson's proof, 

it appears that, except for pathological cases, better performance should | 
be anticipated with this method. | 


Tomek (1976) raises the following interesting question: Let T- 
denote the training set obtained from T by applying Wilson's method for 
editing, "It is natural to ask what would similar editing of T-(leading to 
T", T"', etc.) do to the design set. Should we expect progressively better 
and better classification or will editing of experiments indicate that the 
described method indeed improves the performance of k-NN classification 
considerably ..." (Tomek, 1976). However, Tomek was unable to prove anything 
about the extension of Wilson's result to repeated editing. 


2.2.6 Implementation. Consider the two-dimensional situation illustrated 
in Figure 2-2. For this case, all the k nearest neighbors of X belong to 
C, and hence, the (k, k~) NN rule, if applied, would assign X to Co. 
However, it is clear in this case that a better decision would be to 
consider X as a new class type. To cope with situations when the set of 
classes is not exhaustive, the following modification to the NN rule is 
used: | | | 

(1) Set a boundary threshold Dy on each class k, k= 1,2, ..., m 
(2) Compute d(x, XJ) for j = 1,2, ..., NT 


(3) If d(x, XJ) > DL where eC, do not count XY as a candidate 
for the set of k nearest neighbors. Otherwise record XJ, 


(4) If k nearest neighbors cannot be found, reject X 
(5) If k nearest neighbors can be found but Nx < k*, reject X 


(6) If k nearest neighbors can be found and Now >k*s assign X to c* 
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FIGURE 2-2. TWO DIMENSIONAL CASE 
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The method for setting the boundaries Di. should be determined ad 
hoc based on the knowledge of the class structures and the misclassification 
tolerance level. An adequate Monte-Carlo procedure works as follows: 


| (1) For each pair of training pattern XJ, x" in C. compute 
d(xt, xd) | | 
(2) Determine the value D, for which 8 percent (e.g., B = 95%) of 
the pairs satisfy d(xt, XJ) < Dy | 


The principles of this procedure are closely related to the principles 
of complete-link and graph-theoretic clustering methods. Similar procedures 
can be developed following the single-link or k-means principles. Presently, 
the (k, k*) NN rule is used with the modification just described. This 
method is available to the user with various options for distance functions 
and the option to apply the edited nearest neighbor preprocessing. Patterns 
| rejected by this (k, k*) NN method are submitted to Level III. 


2.3 Classification By Clustering (Level III) 


2.3.1 The Clustering Algorithm. The classification method used in Level III 
is clustering-oriented. It is assumed that the training patterns have passed © 
an imaginary clustering process which has generated M clusters corresponding - 
to the M known classes and assigned each of the training patterns to the 
cluster it belongs to. When a new pattern arrives, the clustering algorithm 
is "reactivated" and decides whether to assign this pattern to one of the 

old classes or to classify it as a new type. Many clustering algorithms 

can be found appropriate. The algorithm used in the current system is — 
described below. It is essentially the same proposed by MacQueen (1967) and 
Sebestyen and Edie (1966). | 
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Notation 


d - distance function 
K(i) - number of clusters at stage i 


For cluster k at stage i denote: 


np (i) - number of members 
C, (i) - centroid of the cluster (average of all the members) 
Zp (i) - covariance matrix 
AL (i) - membership threshold 

Ac(i) = f, + max {d(x2, C,(i)) | x9 in cluster k} 

k A : k | 

BL (i) - non-membership threshold 

B, (i) = feo A (i) fe > ] 
Fas f. - predetermined factors 
Initialization 
Let x! be the only member of cluster |] 
ty =* 
z (1) = predetermined covariance matrix or the identity matrix. 
n, (1) = ] 
K(1) = 1 


Step i 
(1) For k a to K compute 
a Ox", ca) = Ox) - (aya Bt a) ox? - (4) 
(2) Find 
ax, CCl “i (d x", C, (i)]3 
—(3)~s«If A, (i) <d rx', c(i] < B store x! for later processing 


(4) If d cx', C,(i)] <A, assign x! to cluster 2 
and update: 
n (itl) = n (i) + ] 
C(i#1) = E(n-1)c, (4) + x'I/n 
a ee em 
E = 1/n [(n-2) zi) + (n-1) C (i) C (i) ee 
z, (i+1) = n/n-1 [E - C (i) (i)] 


(C, and or are unbiased extimators for the true mean and 


covariance, respectively) 


(5) If B (i) <d ear C.(i)], use X' to establish a new cluster, 


say cluster N, and update: 


K(it1) = KG) 4 
| Cy (i+1) = x! 


uN (i+1) = Ray [z, Te  etago Fe Ze 4) 


Ay (i+1) = aay Uy + a + Anca) 


By (itl) = f, ° An 


i= it] 


The clusters of the training patterns are used to determine the 
corresponding centroids, covariance matrices, and A,» By thresholds. A new 
coming pattern X is assigned to class C. if k is the nearest cluster and 
d(X C)) <A. If d(X,C)) > BL then X constitutes the nucleus for a new 
class. If AL < d(X,C,) < B,; classification of X is deferred for , later 
Stage. In between stages more observations on X are taken, and the learning 


process is activated (Section 6.4). 


Once a new class is established, its classification as a threat 
or non-threat is determined as follows: 


(1) Compute distance from the new class centroid to the nearest 
threat class centroid, dys and to the nearest non-threat 
class centroid, d,. 


(2) Compute d,/d,. 
(3) If d/d, < R,/R, > classify as threat 
d./d. > Ro/Ry > classify as non-threat | 
where: 


R, - the risk involved in classifying as threat 
when in fact it is a non-threat. 


Ro - the risk involved in classifying as non-threat 
when in fact it is a threat. 


2.3.2 Learning in Real Time. For a given object, the data which is 
| accumulated between ty and a for k = 1,2, ..., E-1] is used for updating the 
pattern of this object. Let X = (xX). Kos sees X,) denote the pattern at | 
time te Then for each observation during the period (ty > tyay)> the 
following transformation takes place: | 


| 1 
eee te 


where S. denotes the total number of observations on feature Xs including 
the last one, and x! is the last value observed. Note that at a given point 
of time, the value of S. may be different for different j's. It represents 
the number of times the x sensor(s) were activated for this particular 
object. In the present system, 34 is assumed the same for every j and every 
object. 


Since at time tes a classification decision has to be made, the. 
gap between AL and BE for k = 1,2, ..., m needs to be closed by time te. 


The magnitude G, by which AL is increased and BL is decreased at time t is 
determined by: 
eee 


_ | O a 
G, we a [t/te] foal t ty> to» eeeg te 


|. 


For every a>0, G, is an increasing function of t, which means that larger 
pieces of the initial gap are cut as time progresses. For O<a<l, G. is 
concave while for a>l, G, 1s convex. This means that for O<a<l, the rate of 
increase in G. is more moderate than for a>]... See Figure 2-3 for 


illustration. 


CHANGE FACTOR 


~ FIGURE 2-3. CONVERGENCE OF A, and B. , 


3. EVALUATION 
Sail Overview — 


The evaluation of the algorithm described in the previous sections 
was performed in six major phases. In each of the first three phases, one 
level of the algorithm was independently tested, while in the fourth phase 
the "learning in real time" module was evaluated. The purpose of those 
four phases was to test the validity and sensitivity of individual portions 
of the system and to decide upon the specific parameters to be used. For 
instance, the evaluation of Level II included experiments to determine 
the training set size, the distance function, the k,k* values, the effect 
of editing the training set, and the class boundary threshold. In the fifth 
phase, the performance of our algorithm is compared with the performance 
achievable with an existing RV discrimination system called "ODE". 


For each phase, training and testing patterns were randomly generated © 
by a simulation process. The randomness of these patterns was controlled 
to serve the special purposes of the particular experiment. For instance, in » 
the first phase, the random process was directed to generate many patterns 
which can be classified by using deterministic rules. These patterns were 
of little use for the second and third phases because, if the deterministic 
rules are successful, these patterns would not reach later stages. | 


3.2 Approach 


The design of the evaluation plan is organized in two parts. The 
first part specifies the issues for examination, while the second part 
describes the actual experiments to be performed in order to test these 
issues. In general, each experiment was designed to test more than one 
issue. On the other hand, there are issues which are tested by more than 
one experiment. The design of the evaluation plan concludes with a cross 
reference table of issues to be tested against experiments to be performed. 


By this approach we first determine "where we want to go", and next, "how 
to get there". It enables experiments to be designed in such a way that 
the maximum information will be extracted. The list of issues tested and 
their corresponding experiments are described in the TRW report (CDRL 
Item A005, 17 January 1977). | | 
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